In this lab exercise, we would be analysing a bank marketing dataset for customers investing in deposit schemes. This would help the bank representatives understand and improve the strategizing procedure for customer selection.
Bank Marketing Dataset:
The research paper associated with this data can be found in the following journal:
Paper Citation: Moro, S., Cortez, P. & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems. 62, 22-31 https://dx.doi.org/10.1016/j.dss.2014.03.001
"Deposits" - The term that helps the bank function smoothly. Every bank tries to get their customer to invest in different deposit schemes in their bank. They give good competetive interest rates and offers to keep the customer tied to the schemes. They promote such schemes through various mediums like cellular, telephonic, in-person, emails etc. Every bank tries to analyse various customer factors like their credit history, balances, salary, education etc. when defining the best rates that can be offered.
There is a marketing team that looks after this part of business. They find various ways to communicate the benefits of the deposit scheme to the customers. Some invest and some don't. This Bank Marketing dataset tries to look at all these factors as a whole to predict if a customer would invest or not. Thus, pro-active measures can be taken in the future to on board the right set of customers.
The Bank Marketing Dataset has 11162 rows and 17 attributes that includes both numerical and categorical variables. It originally belongs to the UCI Machine Learning Repository, but can be found on Kaggle as well. We can download this dataset for free from either of the websites.
Analyzing and Visualizing this information can help strategize the right set of customers and their deposit deals and terms. It should definitely give a better result than when chossing any random customers. This is a great step ahead in order to save time and resources of a bank. It gives them the right direction to proceed with in the next deposit campaign cycle.
Which next customer would deposit an amount in the bank? Which customers should be included in the bank deposit strategy?
#Libraries used
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import umap
#Read the dataset
df = pd.read_csv('bank.csv')
df.shape
(11162, 17)
#categorizing between customer who made deposits and who did not
df.groupby('deposit')['deposit'].count()
deposit no 5873 yes 5289 Name: deposit, dtype: int64
Checks for deriving if the datset contains any missing rows or duplicates in the Dataset
#Check For Duplicates
df.duplicated().unique()
array([False])
# Check for null/missing values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11162 entries, 0 to 11161 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 11162 non-null int64 1 job 11162 non-null object 2 marital 11162 non-null object 3 education 11162 non-null object 4 default 11162 non-null object 5 balance 11162 non-null int64 6 housing 11162 non-null object 7 loan 11162 non-null object 8 contact 11162 non-null object 9 day 11162 non-null int64 10 month 11162 non-null object 11 duration 11162 non-null int64 12 campaign 11162 non-null int64 13 pdays 11162 non-null int64 14 previous 11162 non-null int64 15 poutcome 11162 non-null object 16 deposit 11162 non-null object dtypes: int64(7), object(10) memory usage: 1.4+ MB
Refering to the above stats, we understand there aren't any duplicates values in the datasets as the duplicate check returns an array of False.
Thus, we need not eliminate any data from the dataset
When checked for the cases to check if there are any missing values, we observed there aren't Null values present in the dataset. However for some of the columns, we observed an entry called unknown which can be associated with missing in some sense. We will be now doing the further checks to decide if these entry have any effect on our dataset and how they can be handled
# Count the number of unknowns in every data column
unknown_count = {}
for col in df.columns:
count = (df[col] == 'unknown').sum()
unknown_count[col] = count
unknown_count_df = pd.DataFrame(list(unknown_count.items()),columns=['column','unknown_counts'])
unknown_count_df
| column | unknown_counts | |
|---|---|---|
| 0 | age | 0 |
| 1 | job | 70 |
| 2 | marital | 0 |
| 3 | education | 497 |
| 4 | default | 0 |
| 5 | balance | 0 |
| 6 | housing | 0 |
| 7 | loan | 0 |
| 8 | contact | 2346 |
| 9 | day | 0 |
| 10 | month | 0 |
| 11 | duration | 0 |
| 12 | campaign | 0 |
| 13 | pdays | 0 |
| 14 | previous | 0 |
| 15 | poutcome | 8326 |
| 16 | deposit | 0 |
We can see that poutcome of the campaign is something that is not recorded for a pretty high number of dataset to come to a conclusion on imputation. Also, such high imputations might have negative statistical impact. Thus, we would be ignoring this column in the analysis going ahead.
Given, the other two columns (job, education & contact) are categorical fields, we would not be imputing any values to them and considering unknown as a separate entity thorugh our analysis.
#Renaming the columns
df.rename(columns = {'housing':'housing_loan','loan':'personal_loan','default':'credit_default','contact':'contact_mode','pdays':'days_since_last_contact'},inplace=True)
#replace (yes,no) in deposit, housing_loan, personal_loan and credit default status with (0,1)
df['deposit'] = df.apply(lambda x: 0 if x['deposit'] == 'no' else 1,axis=1)
df['housing_loan'] = df.apply(lambda x: 0 if x['housing_loan'] == 'no' else 1,axis=1)
df['personal_loan'] = df.apply(lambda x: 0 if x['personal_loan'] == 'no' else 1,axis=1)
df['credit_default'] = df.apply(lambda x: 0 if x['credit_default'] == 'no' else 1,axis=1)
df.describe()
| age | credit_default | balance | housing_loan | personal_loan | day | duration | campaign | days_since_last_contact | previous | deposit | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 | 11162.000000 |
| mean | 41.231948 | 0.015051 | 1528.538524 | 0.473123 | 0.130801 | 15.658036 | 371.993818 | 2.508421 | 51.330407 | 0.832557 | 0.473840 |
| std | 11.913369 | 0.121761 | 3225.413326 | 0.499299 | 0.337198 | 8.420740 | 347.128386 | 2.722077 | 108.758282 | 2.292007 | 0.499338 |
| min | 18.000000 | 0.000000 | -6847.000000 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 1.000000 | -1.000000 | 0.000000 | 0.000000 |
| 25% | 32.000000 | 0.000000 | 122.000000 | 0.000000 | 0.000000 | 8.000000 | 138.000000 | 1.000000 | -1.000000 | 0.000000 | 0.000000 |
| 50% | 39.000000 | 0.000000 | 550.000000 | 0.000000 | 0.000000 | 15.000000 | 255.000000 | 2.000000 | -1.000000 | 0.000000 | 0.000000 |
| 75% | 49.000000 | 0.000000 | 1708.000000 | 1.000000 | 0.000000 | 22.000000 | 496.000000 | 3.000000 | 20.750000 | 1.000000 | 1.000000 |
| max | 95.000000 | 1.000000 | 81204.000000 | 1.000000 | 1.000000 | 31.000000 | 3881.000000 | 63.000000 | 854.000000 | 58.000000 | 1.000000 |
Save the original dataset for reference before converting the numeric age and balance columns to continuous data for any data reduction/modelling purposes.
df_original = df.copy()
df_original.head(3)
| age | job | marital | education | credit_default | balance | housing_loan | personal_loan | contact_mode | day | month | duration | campaign | days_since_last_contact | previous | poutcome | deposit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 59 | admin. | married | secondary | 0 | 2343 | 1 | 0 | unknown | 5 | may | 1042 | 1 | -1 | 0 | unknown | 1 |
| 1 | 56 | admin. | married | secondary | 0 | 45 | 0 | 0 | unknown | 5 | may | 1467 | 1 | -1 | 0 | unknown | 1 |
| 2 | 41 | technician | married | secondary | 0 | 1270 | 1 | 0 | unknown | 5 | may | 1389 | 1 | -1 | 0 | unknown | 1 |
# Convert discrete age to continuous
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100] # Define your age intervals here
#labels for the age intervals
labels = ['0-10','11-20' ,'21-30', '31-40', '41-50', '51-60', '61-70', '71-80', '81-90', '91-100']
#Create a new column with age intervals
df['age_range'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)
# Display the transformed DataFrame
df = df[['age_range', 'job', 'marital', 'education', 'credit_default', 'balance',
'housing_loan', 'personal_loan', 'contact_mode', 'day', 'month',
'duration', 'campaign', 'days_since_last_contact', 'previous',
'poutcome', 'deposit']]
# Convert the discrete Balance to continuous
# We will be using binning to identify the right no of bins that can be used in this case.
# Create a histogram and observe to finalize the number of bins
plt.hist(df['balance'], bins='auto', edgecolor='k')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram with Auto Binning')
plt.show()
#bins = [500,1000,1500,2000]
We can see that the data is ranging over a huge range but most of the data is concentrated between the smaller ranges. Thus, we woud use bins of varied sizes to accomodate all the variance in data
bins = [0,10,5000,15000,45000,90000]
#labels for the age ranges
labels = ['0-10','11-5000','5001-15000','15001-45000','45001-90000']
df['balance'] = df.apply(lambda x: 0 if x['balance'] < 0 else x['balance'],axis=1)
#Create a new column with age ranges
df['balance_range'] = pd.cut(df['balance'], bins=bins, labels=labels, right=False)
df[df['balance_range'].isna()==True]
#Display the transformed DataFrame
df = df[['age_range', 'job', 'marital', 'education', 'credit_default', 'balance_range',
'housing_loan', 'personal_loan', 'contact_mode', 'day', 'month',
'duration', 'campaign', 'days_since_last_contact', 'previous',
'poutcome', 'deposit']]
| Index | Feature | Description | Type | Range |
|---|---|---|---|---|
| 1 | age | Age of the customer | Integer | 18-95 |
| 2 | job | Job of the customer | Categorical | admin, technician, unemployed.. etc |
| 3 | marital | Marital Status | Categorical | Married, Single, divorced, unknown. |
| 4 | education | Education level of the customer | Categorical | High School, professional, illiterate etc. |
| 5 | credit_default | Customer Credit default status | Binary | Yes/No |
| 6 | balance | Customer's Average Yearly Account Balance | Numeric | -6487 - 81204 |
| 7 | housing_loan | Wheather customer has a housing loan (Yes/No) | Binary | 0/1 |
| 8 | personal_loan | Whether the customer has any personal loan (Yes/No) | Binary | 0/1 |
| 9 | contact_mode | Mode of promotion contact | Categorical | Cellular/Telephonic |
| 10 | day | Day of contact month | Numeric | 1-31 |
| 11 | month | Month of contact | Categorical | Jan-Dec |
| 12 | duration | last contact duration in seconds | Numeric | 2 - 3881 |
| 13 | campaign | Number of contacts to the customer during the campaign | Numeric | 1 - 63 |
| 14 | days_since_last_contact | Days passed since last contacted - (-1 -> Inidcates no previous contact) | Numeric | -1 - 854 |
| 15 | previous | Whether the customer previously accpeted the promotion (Yes/No) | Binary | 0/1 |
| 16 | poutcome | Outcome of the previous marketing campaign | Categorical | 'Unknown','Others','Failure,'Success' |
| 17 | deposit | Whether the customer made a deposit (Yes/No) | Binary | 0/1 } |
As discussed, our aim is to have the Customer start a deposit in the bank. For that, the marketing team aim to select and target a specific group of customers. To filter out the best set of customers to be chosen, we try the following visualizations.
df_age = df.groupby(by='age_range')['deposit'].agg(['mean','count']).reset_index()
df_age
# Plot a bar chart showing the deposit rates in every range
sns.set(style='whitegrid')
plt.figure(figsize=(10,6))
ax1 = sns.barplot(x='age_range',y='mean',data=df_age,color='skyblue',label='deposit ratio')
plt.xlabel('Age Range')
plt.ylabel('Deposit Rate')
plt.title('Deposit rate & customer count by age range')
plt.xticks(rotation=45)
# Add a second y-axis addrssing the count of customers
ax2 = ax1.twinx()
sns.lineplot(x='age_range',y='count',data=df_age,ax=ax2,marker='o',color='orange',label='Customer count')
ax1.set_ylabel('Deposit Rate')
ax2.set_ylabel('Customer Count')
# Display legends
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.tight_layout()
plt.show()
Deposit ratio & customer count by age range demonstrates the positve deposits happening in every age bucket. As a whole we can see that the age bucke 91-100 has best deposit rate but has a very nominal count on which no conclusions can be drawn.
We must refer to both the deposit rate and the count of customer to come a solid conclusion. We can observe that the age bucket 21-30 has the best deposits to count ratio and the age bucket 31-40 has the worst depsoits to count ratio. Thus, we can say that the age group between 21-30 have invested in deposits the more through these promotion campaigns and the age group between 31-40 is investing less.
Dig deeper into 31-40 to investigate the issue.
df[df['age_range'] == '31-40'].groupby(by=['job','balance_range']).job.agg(['count']).reset_index()
| job | balance_range | count | |
|---|---|---|---|
| 0 | admin. | 0-10 | 88 |
| 1 | admin. | 11-5000 | 440 |
| 2 | admin. | 5001-15000 | 17 |
| 3 | admin. | 15001-45000 | 2 |
| 4 | admin. | 45001-90000 | 0 |
| 5 | blue-collar | 0-10 | 134 |
| 6 | blue-collar | 11-5000 | 582 |
| 7 | blue-collar | 5001-15000 | 29 |
| 8 | blue-collar | 15001-45000 | 5 |
| 9 | blue-collar | 45001-90000 | 0 |
| 10 | entrepreneur | 0-10 | 24 |
| 11 | entrepreneur | 11-5000 | 94 |
| 12 | entrepreneur | 5001-15000 | 7 |
| 13 | entrepreneur | 15001-45000 | 0 |
| 14 | entrepreneur | 45001-90000 | 0 |
| 15 | housemaid | 0-10 | 8 |
| 16 | housemaid | 11-5000 | 52 |
| 17 | housemaid | 5001-15000 | 4 |
| 18 | housemaid | 15001-45000 | 1 |
| 19 | housemaid | 45001-90000 | 0 |
| 20 | management | 0-10 | 158 |
| 21 | management | 11-5000 | 934 |
| 22 | management | 5001-15000 | 85 |
| 23 | management | 15001-45000 | 6 |
| 24 | management | 45001-90000 | 0 |
| 25 | retired | 0-10 | 0 |
| 26 | retired | 11-5000 | 3 |
| 27 | retired | 5001-15000 | 0 |
| 28 | retired | 15001-45000 | 0 |
| 29 | retired | 45001-90000 | 0 |
| 30 | self-employed | 0-10 | 15 |
| 31 | self-employed | 11-5000 | 140 |
| 32 | self-employed | 5001-15000 | 9 |
| 33 | self-employed | 15001-45000 | 2 |
| 34 | self-employed | 45001-90000 | 0 |
| 35 | services | 0-10 | 78 |
| 36 | services | 11-5000 | 323 |
| 37 | services | 5001-15000 | 16 |
| 38 | services | 15001-45000 | 1 |
| 39 | services | 45001-90000 | 0 |
| 40 | student | 0-10 | 4 |
| 41 | student | 11-5000 | 57 |
| 42 | student | 5001-15000 | 4 |
| 43 | student | 15001-45000 | 0 |
| 44 | student | 45001-90000 | 0 |
| 45 | technician | 0-10 | 123 |
| 46 | technician | 11-5000 | 680 |
| 47 | technician | 5001-15000 | 50 |
| 48 | technician | 15001-45000 | 8 |
| 49 | technician | 45001-90000 | 1 |
| 50 | unemployed | 0-10 | 13 |
| 51 | unemployed | 11-5000 | 102 |
| 52 | unemployed | 5001-15000 | 7 |
| 53 | unemployed | 15001-45000 | 0 |
| 54 | unemployed | 45001-90000 | 0 |
| 55 | unknown | 0-10 | 5 |
| 56 | unknown | 11-5000 | 6 |
| 57 | unknown | 5001-15000 | 1 |
| 58 | unknown | 15001-45000 | 0 |
| 59 | unknown | 45001-90000 | 0 |
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df_fil = df[df['age_range'] == '31-40']
df_fil = df_fil[['job','balance_range','deposit']]
# Pivot the data to create a table with Balance_Range as rows, Job as columns, and count of deposits as values
pivot_table = df_fil.pivot_table(index='balance_range', columns='job', values='deposit', aggfunc='mean', fill_value=0) * 100
# Create a heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, fmt='.1f', cmap='YlGnBu')
plt.title('Deposits by Balance Range and Job')
plt.xlabel('Job')
plt.ylabel('Balance Range')
plt.show()
The heatmap explains that the for the age group 31-40, high percentage of customer working across different job roles invest in the schemes. It mainly comes upto the balance that they have in their accounts.
Customers having a balance range of 11-5000 and 5001 - 15000 invest the most the in the schemes. The other balance groups don't show a consistent investment rate through all the job groups. Thus, it can be seen that the combination balance and job plays a major role for this age group to derive the percentage of deposits.
# Create subplots for each factor
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
# Plot Credit_Default vs. Deposit
sns.countplot(x='credit_default', hue='deposit', data=df, ax=axes[0])
axes[0].set_title('Credit Default vs Deposit')
# Plot Personal_Loan vs. Deposit
sns.countplot(x='personal_loan', hue='deposit', data=df, ax=axes[1])
axes[1].set_title('Personal loan vs Deposit')
# Plot Car_Loan vs. Deposit
sns.countplot(x='housing_loan', hue='deposit', data=df, ax=axes[2])
axes[2].set_title('Housing loan vs Deposit')
# Add legend to each subplot
for ax in axes:
ax.legend(title='Deposit', labels=['No', 'Yes'])
# Adjust spacing between subplots
plt.tight_layout()
# Show the plots
plt.show()
Looking at the three comparisons, we can observe that:
Thus, we can conclude that Housing loan is a parameter that the customers consider when they think on investing in a deposit scheme.
df_education = df.groupby(by='education')['deposit'].agg(['mean','count']).reset_index()
df_education
# Plot a bar chart showing the deposit rates in every range
sns.set(style='whitegrid')
plt.figure(figsize=(10,6))
ax1 = sns.barplot(x='education',y='mean',data=df_education,color='lightgreen',label='deposit ratio')
plt.xlabel('Education')
plt.ylabel('Deposit Rate')
plt.title('Deposit rate & customer count by Education')
plt.xticks(rotation=45)
# Add a second y-axis addrssing the count of customers
ax2 = ax1.twinx()
sns.lineplot(x='education',y='count',data=df_education,ax=ax2,marker='o',color='blue',label='Customer count')
ax1.set_ylabel('Deposit Rate')
ax2.set_ylabel('Customer Count')
# Display legends
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.tight_layout()
plt.show()
From the Deposit Rate and Customer Count graph, we can understand that people who have completed either secondary or tertiary levels of education, have a higher preference on option for a deposit scheme as compared to the customers with just primary level of education.
Thus, when strategizing the next deposit cycle, the customers belonging to tertiary followed by secondary background can be targetted a little more.
# Select the columns for boxplot visualization
selected_columns = ['duration', 'campaign']
# Create subplots for each selected variable
plt.figure(figsize=(12, 5))
for i, col in enumerate(selected_columns, 1):
plt.subplot(1, 2, i)
sns.boxplot(data=df, x='deposit', y=col, palette='Set2')
plt.title(f'Box Plot for {col}')
plt.xlabel('Deposit Status')
plt.ylabel(col)
plt.tight_layout()
plt.show()
The box plot for duration indicates that in the cases where people deposited had a significant higher call duration which ultimately shows their interest in the scheme. The upperquartile for the YES cases wider the lower quartile for the same. Thus, more the engagement of the customer in the call, higher the chances that customer would invest in the deposit
The box plot for campaign indicates that there isn't any significant effect of contacting the same set of customers multiple times. The quadrants are almost equally distributed for the higher and lower quartiles indicating that the there isn't any significantt changes with the increase in the contact attempts.
df_corrleation = df[['credit_default','housing_loan','personal_loan','day','duration','campaign','days_since_last_contact','previous']]
corr_matrix = df_corrleation.corr()
plt.figure(figsize = (10,6))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm',linewidth=0.5)
#Add labels
plt.title('Correlation heatmap')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
From the heatmap, we can observe that the correlation between the columns in the dataset is not much of a concern. Each column appears to be more or less independent of the other with the highest correlation factor of 51% observed beteen number of previous contacts and total duration of the contact call
There exists positive correlation between credit default, personal loan, housing loan, contact day of month and campaign contacts have a positive correlation with each other, where as previously contacted, duration of contact and days since last contact have a negaive correlation with each other.
We use dimension reductionality to primarily understand if we can compute the solution to the problem even after squeezing in the no of dimensions/features involved in the problem. We must understand the trade off done on features to improve the computational speed before using this technique
#Select the numerical dependent data where dimensions can be reduced.
df_dimensions = df_original[['age','balance','campaign', 'days_since_last_contact', 'previous','duration']]
df_dimensions.head(3)
| age | balance | campaign | days_since_last_contact | previous | duration | |
|---|---|---|---|---|---|---|
| 0 | 59 | 2343 | 1 | -1 | 0 | 1042 |
| 1 | 56 | 45 | 1 | -1 | 0 | 1467 |
| 2 | 41 | 1270 | 1 | -1 | 0 | 1389 |
## Visualizing the original data before PCA
from sklearn.preprocessing import StandardScaler
numeric = df_dimensions.columns
#Visualize the original data
plt.figure(figsize=(12, 6))
for i, feature in enumerate(numeric, 1):
plt.subplot(2, 3, i)
plt.scatter(df_original[feature], df_original['deposit'], alpha=0.5)
plt.title(f'Original {feature} vs Deposit Status')
plt.xlabel(feature)
plt.tight_layout()
plt.show()
From the above plots, we can understand:
#Scale the data
scaler = StandardScaler()
scaled_data= scaler.fit_transform(df_dimensions)
df_scaled = pd.DataFrame(data=scaled_data, columns=numeric)
# Apply PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
pca = PCA(n_components=3) # We'll keep the first 3 principal components for visualization
X_pca = pca.fit_transform(scaled_data)
import seaborn as sns
cmap = sns.set(style="darkgrid")
# this function definition just formats the weights into readable strings
def get_feature_names_from_weights(weights, names):
tmp_array = []
for comp in weights:
tmp_string = ''
for fidx,f in enumerate(names):
if fidx>0 and comp[fidx]>=0:
tmp_string+='+'
tmp_string += '%.2f*%s ' % (comp[fidx],f[:-5])
tmp_array.append(tmp_string)
return tmp_array
plt.style.use('default')
# Analyse to see how the components looks
pca_weight_strings = get_feature_names_from_weights(pca.components_, df_scaled.columns)
# transformed output dataframe
df_pca = pd.DataFrame(X_pca,columns=[pca_weight_strings])
from matplotlib.pyplot import scatter
# Scatter plots to observe the reduced dimensions
color_dict = {0: 'green', 1: 'blue'}
point_colors = [color_dict[status] for status in df['deposit']]
ax = scatter(X_pca[:,0], X_pca[:,1],c=point_colors, cmap=cmap)
plt.xlabel(pca_weight_strings[0])
plt.ylabel(pca_weight_strings[1])
plt.title('Scatter plot for reduced dimensions')
plt.legend('')
plt.show()
From the above Scatter plot for reduced dimension graph, we can see both the cases of deposits being distinguished.
However there is a good proportion of overlap in the features for both the cases of deposits. Thus, making a clear distinction at this point in a reduced dimension is quite difficult. We could test alternate methods or add more classifiable features in order to make this clear distinction.
#Visualize the variance tradeoff involved when reducing each dimensions
import numpy as np
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
pca = PCA(n_components=6)
X_pca = pca.fit(scaled_data)
plot_explained_variance(pca)
C:\Users\Prashant\anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:544: DeprecationWarning: plotly.graph_objs.XAxis is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.layout.XAxis - plotly.graph_objs.layout.scene.XAxis C:\Users\Prashant\anaconda3\lib\site-packages\plotly\graph_objs\_deprecations.py:572: DeprecationWarning: plotly.graph_objs.YAxis is deprecated. Please replace it with one of the following more specific types - plotly.graph_objs.layout.YAxis - plotly.graph_objs.layout.scene.YAxis
We can see and interpret from above plot(Fig2) that only 45% variance can be convered by the 2 dimensionas and around 65% is covered when reduced to 3.
Also, since 2nd, 3rd and 4th dimensions display almost equal contribution to the variance, ignoring any of these dimensions might result in missing out of important features. Though, we can classify the information, that might not be of the best accuracy.
Given the use case, it is to identify and priortize the customer for deposit schemes, we can attempt on this model and evaluate the performance on this data to analyse the effect.
#We'll see and visualize the calculated eigen vectors
eigenvectors = pca.components_
eigenvectors
array([[ 0.04714801, 0.07496487, -0.19556165, 0.69376745, 0.68552947,
-0.05146893],
[ 0.67745748, 0.69351567, -0.09534339, -0.07966926, -0.05365854,
0.20437853],
[ 0.22780032, 0.08849141, 0.62532506, 0.01400722, 0.08359068,
-0.73624809],
[-0.12833079, 0.06771689, 0.74464645, 0.08676539, 0.1727498 ,
0.62215478],
[-0.68557671, 0.70765288, -0.03691687, -0.02060104, -0.03211298,
-0.16246066],
[-0.02124177, -0.01507903, -0.07597802, -0.71006387, 0.69950624,
-0.00700582]])
# Plot a heatmap to understand the correaltion and dependence of dimensions to vectors to decode the reduction process
plt.figure(figsize=(10, 6))
plt.imshow(pca.components_, cmap='viridis', aspect='auto')
plt.colorbar(label='Eigenvector Value')
plt.xticks(np.arange(6), [f'Eigenvector {i+1}' for i in range(6)])
plt.xlabel('Eigenvectors')
plt.ylabel('Dimensions')
plt.title('Heatmap of Eigenvector Values')
plt.show()
The heatmap for eigen vectors helps us interpret the influence of the dimensional space. The Eigen Vectors 1,4 & 6 have a high negative correlation in the 5th, 6th and 3rd dimensions respectively whereas eigen vector 2 has a high postive correlation in the 2nd & 5th dimensional space. The other dimensions have relatively equal distribution of the vectors involved in the reduction of the dimensional space.
We tried exploring on how UMAP might behave when given the same dataset as an input to understand the process that UMAP follows:
# get the numeric columns
df_umap = df_original[['age','balance','campaign', 'days_since_last_contact', 'previous','duration']]
# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_umap)
# Model with n_components
umap_model = umap.UMAP(n_neighbors=5, n_components=2) # Adjust parameters as needed
umap_result = umap_model.fit_transform(scaled_data)
# Map the output datapoints to their respective colors to understand the reduced distribution
color_dict = {0: 'green', 1: 'blue'}
point_colors = [color_dict[status] for status in df_original['deposit']]
plt.scatter(umap_result[:, 0], umap_result[:, 1],c=point_colors ,cmap='viridis')
plt.title('UMAP Projection')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.legend()
plt.show()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
The UMAP Projection indicates that the datapoints in the dataset share similar attributes. They share certain similar features in their original dimensional space. Thus, it shows some of the important datasets were lost during the reduction process. However, we still can observe some slight pattern which can be useful to this usecase in the classification process.